This Jupyter notebook demonstrates fitting and tuning complex ARIMA models for time series forecasting. This example includes multiple realistic modeling situations that must be overcome including:
This example uses the Sunspot data set which includes several hundred years of Monthly Sunspot observations. The data are downloaded directly from a Github repository as part of the example.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
Work with monthly sunspot data dating back to the 1700s.
url_sunspot = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-sunspots.csv'
sunspot_df = pd.read_csv( url_sunspot )
sunspot_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2820 entries, 0 to 2819 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Month 2820 non-null object 1 Sunspots 2820 non-null float64 dtypes: float64(1), object(1) memory usage: 44.2+ KB
The Month column is a string (object) written as the YYYY-MM, as shown below.
sunspot_df
| Month | Sunspots | |
|---|---|---|
| 0 | 1749-01 | 58.0 |
| 1 | 1749-02 | 62.6 |
| 2 | 1749-03 | 70.0 |
| 3 | 1749-04 | 55.7 |
| 4 | 1749-05 | 85.0 |
| ... | ... | ... |
| 2815 | 1983-08 | 71.8 |
| 2816 | 1983-09 | 50.3 |
| 2817 | 1983-10 | 55.8 |
| 2818 | 1983-11 | 33.3 |
| 2819 | 1983-12 | 33.4 |
2820 rows × 2 columns
Create the datetime object using the pd.to_datetime() function.
sunspot_df['date_dt'] = pd.to_datetime( sunspot_df.Month )
sunspot_df
| Month | Sunspots | date_dt | |
|---|---|---|---|
| 0 | 1749-01 | 58.0 | 1749-01-01 |
| 1 | 1749-02 | 62.6 | 1749-02-01 |
| 2 | 1749-03 | 70.0 | 1749-03-01 |
| 3 | 1749-04 | 55.7 | 1749-04-01 |
| 4 | 1749-05 | 85.0 | 1749-05-01 |
| ... | ... | ... | ... |
| 2815 | 1983-08 | 71.8 | 1983-08-01 |
| 2816 | 1983-09 | 50.3 | 1983-09-01 |
| 2817 | 1983-10 | 55.8 | 1983-10-01 |
| 2818 | 1983-11 | 33.3 | 1983-11-01 |
| 2819 | 1983-12 | 33.4 | 1983-12-01 |
2820 rows × 3 columns
The series shows a cyclic pattern but we do not know if this is a Seasonal pattern! We cannot say for certain if the patterns repeat every year. The sun goes through a solar cycle which last 10 to 12 years. Thus, the cyclical pattern shown below is not a true seasonal pattern that repeats every year. This is a complex trend-cycle!
sns.set_style('whitegrid')
sns.relplot( data = sunspot_df, x = 'date_dt', y = 'Sunspots', kind='line', aspect=2.5)
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Let's use data manipulation techniques to help visualize the data. We will extract the year and cut by the century.
sunspot_df['the_year'] = sunspot_df.date_dt.dt.year
Use the pd.cut() function to create the century "bins". We can set include_lowest to be False here because we know that the earliest point in time does not fall directly on the first bin starting point.
sunspot_df['century'] = pd.cut( sunspot_df.the_year, bins=[1700, 1800, 1900, 2000], include_lowest=False)
sunspot_df
| Month | Sunspots | date_dt | the_year | century | |
|---|---|---|---|---|---|
| 0 | 1749-01 | 58.0 | 1749-01-01 | 1749 | (1700, 1800] |
| 1 | 1749-02 | 62.6 | 1749-02-01 | 1749 | (1700, 1800] |
| 2 | 1749-03 | 70.0 | 1749-03-01 | 1749 | (1700, 1800] |
| 3 | 1749-04 | 55.7 | 1749-04-01 | 1749 | (1700, 1800] |
| 4 | 1749-05 | 85.0 | 1749-05-01 | 1749 | (1700, 1800] |
| ... | ... | ... | ... | ... | ... |
| 2815 | 1983-08 | 71.8 | 1983-08-01 | 1983 | (1900, 2000] |
| 2816 | 1983-09 | 50.3 | 1983-09-01 | 1983 | (1900, 2000] |
| 2817 | 1983-10 | 55.8 | 1983-10-01 | 1983 | (1900, 2000] |
| 2818 | 1983-11 | 33.3 | 1983-11-01 | 1983 | (1900, 2000] |
| 2819 | 1983-12 | 33.4 | 1983-12-01 | 1983 | (1900, 2000] |
2820 rows × 5 columns
Extract the month from the date time object.
sunspot_df['the_month'] = sunspot_df.date_dt.dt.month
Plot the sunspot activity per month across all years, grouped by the century.
sns.relplot(data = sunspot_df, x='the_month', y='Sunspots', kind='line', estimator=None,
units='the_year', col='century', alpha=0.25)
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Summarize within each month across all years in each century.
sns.catplot(data = sunspot_df, x='the_month', y='Sunspots', kind='violin', col='century')
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Wait...why do the violin plots show negative values? Do we have negative values in the data? No!
sunspot_df.describe()
| Sunspots | date_dt | the_year | the_month | |
|---|---|---|---|---|
| count | 2820.000000 | 2820 | 2820.000000 | 2820.000000 |
| mean | 51.265957 | 1866-06-16 10:38:17.872340992 | 1866.000000 | 6.500000 |
| min | 0.000000 | 1749-01-01 00:00:00 | 1749.000000 | 1.000000 |
| 25% | 15.700000 | 1807-09-23 12:00:00 | 1807.000000 | 3.750000 |
| 50% | 42.000000 | 1866-06-16 00:00:00 | 1866.000000 | 6.500000 |
| 75% | 74.925000 | 1925-03-08 18:00:00 | 1925.000000 | 9.250000 |
| max | 253.800000 | 1983-12-01 00:00:00 | 1983.000000 | 12.000000 |
| std | 43.448971 | NaN | 67.850074 | 3.452665 |
The violin plot is visualizing the distributional shape with a kernel density estimate. It's just approximating the behavior, and sometimes the default settings aren't the best.
sns.displot( data = sunspot_df, x='Sunspots', kind='kde' )
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot( data = sunspot_df, x='Sunspots', hue='century', kind='kde', common_norm=False )
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
We can adjust the settings of the kernel density estimate by modifying the bandwidth. Small values can be susceptible to noise in the data.
sns.displot( data = sunspot_df, x='Sunspots', hue='century', kind='kde', bw_adjust=0.15, common_norm=False )
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Large values will overly smooth the data.
sns.displot( data = sunspot_df, x='Sunspots', hue='century', kind='kde', bw_adjust=2.25, common_norm=False )
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
As we smooth the data more and more, we see that the shape starts to look more Gaussian. Normal distributions are not appropriate when variables are bounded. We know that the sunspot activity cannot be negative. So it would be incorrect to use a Gaussian likelihood with this data because we cannot have negative values!
One way to deal with this issue is to transform the response. Rather than working with the response directly, we will instead model a transformed variable that can be positive and negative. A simple way to do this is with the natural log. Log-transformations remove skew and make distributions more symmetric, and thus are more appropriately modeled with Gaussians.
However, we cannot apply the natural log function to the data! The natural log of zero is -infinity! A simple "hack" around this is to add in a small constant to our data before transforming. Adding in the constant does not change the shape of the distribution. Let's check the minimum value greater than zero.
sunspot_df.loc[ sunspot_df.Sunspots > 0, :].Sunspots.min()
0.1
A rule of thumb is to add a fraction of the smallest non-zero value, so let's add in 0.05 to the Sunspots variable and then take the natural log.
sunspot_df['log_sunspot'] = np.log( sunspot_df.Sunspots + 0.05 )
Visualize the distribution of the natural log of Sunspots.
sns.displot( data = sunspot_df, x='log_sunspot', kind='kde' )
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The natural log transformation "stretches" the scale of the low values which would have been "clouded" by the larger values in the original scale.
sns.displot( data = sunspot_df, x='log_sunspot', kind='kde', hue='century', common_norm=False )
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Let's now plot the log-transformed value overtime. We can see the "zero" behavior as the odd-looking spikes in our log-transformed space.
sns.relplot( data = sunspot_df, x = 'date_dt', y = 'log_sunspot', kind='line', aspect=2.5)
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Let's separate the log-transformed values as a Pandas series and set the DateTimeIndex appropriately.
sunspot_series = sunspot_df.log_sunspot
sunspot_series.index = sunspot_df.date_dt
The time series is resampled to force a Month Start. The .sum() method is applied to add together multiple measurements within a month, if multiple measurements exist.
sunspot_ready = sunspot_series.copy().resample('MS').sum()
sunspot_ready
date_dt
1749-01-01 4.061305
1749-02-01 4.137564
1749-03-01 4.249209
1749-04-01 4.020877
1749-05-01 4.443239
...
1983-08-01 4.274581
1983-09-01 3.918999
1983-10-01 4.022670
1983-11-01 3.507058
1983-12-01 3.510052
Freq: MS, Name: log_sunspot, Length: 2820, dtype: float64
The time series plot is just like our previous visualization.
sunspot_ready.plot( figsize=(15, 6) )
plt.show()
Next, consider the lag plot associated with several lags. We can see the impact of the "zeros" in the subplots below.
lags_use = [1, 2, 6, 12, 36, 72, 120, 132, 260]
fig, ax = plt.subplots(3, 3, figsize=(12, 12), sharex=True, sharey=True)
ax = ax.ravel()
for k in range(len(lags_use)):
pd.plotting.lag_plot( sunspot_ready, lag=lags_use[k], ax=ax[k] )
ax[k].plot( ax[k].get_xlim(), ax[k].get_ylim(), 'k--')
ax[k].set_title('lag: ' + str(lags_use[k]) )
plt.show()
The autocorrelation plot reveals the series is highly autocorrelated!
fig, ax = plt.subplots( figsize = (12, 8) )
sm.graphics.tsa.plot_acf( sunspot_ready.values.squeeze(), lags=360, ax = ax)
plt.show()
We can also examine the partial autocorrelation which examines the remaining autocorrelation after controlling for the autocorrelation between subsequent data points.
fig, ax = plt.subplots( figsize = (12, 8) )
sm.graphics.tsa.plot_pacf( sunspot_ready.values.squeeze(), lags=360, ax = ax)
plt.show()
Our visualizations have revealed this is a rather complex data set. Instead of using the classic additive decomposition, let's jump straight to the STL decomposition.
from statsmodels.tsa.seasonal import STL
sunspot_stl = STL( sunspot_ready )
sunspot_stl_fit = sunspot_stl.fit()
plt.rcParams['figure.figsize'] = 18,8
fig = sunspot_stl_fit.plot()
It does not look like there are any clear seasonal trends in the data! The data appears to have trend-cycles that vary overtime.
Let's split the data set into a training set and a hold out. This will give us a sense of how well our model performs at forecasting future behavior. Let's treat the last 11 years as the hold-out set.
sunspot_train = sunspot_ready.loc[ sunspot_ready.index < '1973-01-01' ].copy()
fig, ax = plt.subplots( figsize=(15, 6) )
sunspot_ready.plot( ax = ax, label='all' )
sunspot_train.plot( ax = ax, label='train' )
ax.legend()
plt.show()
from statsmodels.tsa.arima.model import ARIMA
With ARIMA modeling, we must specify the order for the AutoRegressive (AR) lags, the degree of differencing, and the order of the Moving Average (MA) terms. The order is denoted as $(p,d,q)$. Let's start out by first considering $(0,0,1)$, thus we are using a MA of order 1.
There are several options to the ARIMA() function that help us understand the constraints on the coefficients of the model. The enforce_stationarity flag controls the constraints on the AR coefficients and the enforce_invertibility flag controls the constraints on the MA coefficients. Enforcing a stationary series requires the AR coefficients are between -1 and +1 (for $\mathrm{AR}(p=1)$, the constraints are more complicated for higher order AR models). Enforcing the series is invertible requires the the MA coefficients are also between -1 and +1 for $\mathrm{MA}(q=1)$ (again the constraints are more complicated for higher order MA models). Invertibility represents that more recent data have a greater influence on the forecasts than past data. If the coefficients could be greater than $|1|$, then the data in the distant past can have a great influence on the future behavior.
For those reasons we will force the series to be invertible during the fitting of the MA terms.
sunspots_ma_1_model_fit = ARIMA( sunspot_train, order=(0, 0, 1), enforce_invertibility=True).fit()
print( sunspots_ma_1_model_fit.summary() )
SARIMAX Results
==============================================================================
Dep. Variable: log_sunspot No. Observations: 2688
Model: ARIMA(0, 0, 1) Log Likelihood -4241.981
Date: Tue, 30 Mar 2021 AIC 8489.963
Time: 14:02:07 BIC 8507.652
Sample: 01-01-1749 HQIC 8496.361
- 12-01-1972
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 3.2970 0.056 58.446 0.000 3.186 3.408
ma.L1 0.5894 0.008 75.511 0.000 0.574 0.605
sigma2 1.3746 0.027 50.632 0.000 1.321 1.428
===================================================================================
Ljung-Box (L1) (Q): 309.85 Jarque-Bera (JB): 9179.01
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.64 Skew: -2.35
Prob(H) (two-sided): 0.00 Kurtosis: 10.73
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
The ARIMA model includes a diagnostic plot. Do you think our MA model satisifies the linear model assumptions?
sunspots_ma_1_model_fit.plot_diagnostics(figsize=(16, 8))
plt.show()
Let's forecast the hold-out set and see how the predictions compare to the observations. The data are sampled monthly so 11 years with 12 seasons per year corresponds to 11 * 12 = 132 forecast steps.
sunspots_ma_1_model_forecast = sunspots_ma_1_model_fit.forecast(132)
plt.figure( figsize=(15, 6) )
plt.plot( sunspot_ready, color='steelblue' )
plt.plot( sunspot_train, color='orange' )
plt.plot( sunspots_ma_1_model_fit.fittedvalues, color='black', linestyle='--')
plt.plot( sunspots_ma_1_model_forecast, color='black' )
plt.show()
Let's define a function to help with the forecasts. The function will return will return the forecasted mean and prediction interval.
def make_forecast(my_mod, start_time, final_time):
forecast_object = my_mod.get_prediction(start_time, final_time)
forecast_mean = forecast_object.predicted_mean
forecast_ci = forecast_object.conf_int()
res = pd.DataFrame({'forecast_mean': forecast_mean,
'forecast_lwr': forecast_ci.iloc[:, 0],
'forecast_upr': forecast_ci.iloc[:, 1]},
index=forecast_ci.index)
return res
And now let's forecast the hold-out set returning the mean and prediction interval as well.
h_start = sunspots_ma_1_model_forecast.index.min()
h_final = sunspots_ma_1_model_forecast.index.max()
sunspot_ma_1_forecast_summary = make_forecast( sunspots_ma_1_model_fit, h_start, h_final )
sunspot_ma_1_forecast_summary
| forecast_mean | forecast_lwr | forecast_upr | |
|---|---|---|---|
| 1973-01-01 | 3.564141 | 1.266181 | 5.862102 |
| 1973-02-01 | 3.297035 | 0.629607 | 5.964463 |
| 1973-03-01 | 3.297035 | 0.629607 | 5.964463 |
| 1973-04-01 | 3.297035 | 0.629607 | 5.964463 |
| 1973-05-01 | 3.297035 | 0.629607 | 5.964463 |
| ... | ... | ... | ... |
| 1983-08-01 | 3.297035 | 0.629607 | 5.964463 |
| 1983-09-01 | 3.297035 | 0.629607 | 5.964463 |
| 1983-10-01 | 3.297035 | 0.629607 | 5.964463 |
| 1983-11-01 | 3.297035 | 0.629607 | 5.964463 |
| 1983-12-01 | 3.297035 | 0.629607 | 5.964463 |
132 rows × 3 columns
Let's visualize the results below.
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_ready, color='steelblue' )
ax.plot( sunspot_train, color='orange' )
ax.fill_between( sunspot_ma_1_forecast_summary.index,
sunspot_ma_1_forecast_summary.forecast_lwr,
sunspot_ma_1_forecast_summary.forecast_upr,
color = 'orange',
alpha=0.33)
ax.plot( sunspot_ma_1_forecast_summary.forecast_mean, color='red' )
ax.plot( sunspots_ma_1_model_fit.fittedvalues, color='black', linestyle='--')
plt.show()
The model clearly is not that great...but what if we want the forecasts in the original output space, not the log-transformed space? We must undo or back-transform!
VERY IMPORTANT: we can apply back-transformation to quantiles. Thus, back-transformation the prediction interval in the log-space gives us the prediction interval in the original space. HOWEVER, back-transforming the predicted mean in the log-space does NOT give us the predicted in the original space. Instead, we get the predicted or forecasted MEDIAN. The mean of a Gaussian is equal to the median, so we are actually back-transforming or undoing the transformation on the median. Properly back-transforming the mean takes a little extra effort that we will not go into here.
Let's define a function which back-transforms the data. The function accepts a constant, a, which represents the small value we added to the zeros. The back-transformation of the log-transformed quantity, $z$, to the original response space, $y$, is:
def undo_log_transform(z, a):
return( np.exp(z) - a )
Let's visualize the back-transformed forecasts, but remember we are actually comparing the forecasted median to the observed values.
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspot_train.values, 0.05), color='orange' )
ax.fill_between( sunspot_ma_1_forecast_summary.index,
undo_log_transform(sunspot_ma_1_forecast_summary.forecast_lwr, 0.05),
undo_log_transform(sunspot_ma_1_forecast_summary.forecast_upr, 0.05),
color = 'orange',
alpha=0.33)
ax.plot( sunspot_ma_1_forecast_summary.index,
undo_log_transform(sunspot_ma_1_forecast_summary.forecast_mean, 0.05), color='red' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspots_ma_1_model_fit.fittedvalues, 0.05),
color='black', linestyle='--')
plt.show()
Our order-1 MA model isn't that great.
What if we try an AR model with order 1? We will enforce stationarity of the series for the fitting.
sunspots_ar_1_model_fit = ARIMA( sunspot_train, order=(1, 0, 0), enforce_stationarity=True).fit()
print( sunspots_ar_1_model_fit.summary() )
SARIMAX Results
==============================================================================
Dep. Variable: log_sunspot No. Observations: 2688
Model: ARIMA(1, 0, 0) Log Likelihood -3406.061
Date: Tue, 30 Mar 2021 AIC 6818.123
Time: 14:02:08 BIC 6835.812
Sample: 01-01-1749 HQIC 6824.521
- 12-01-1972
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 3.2992 0.192 17.146 0.000 2.922 3.676
ar.L1 0.8321 0.009 90.287 0.000 0.814 0.850
sigma2 0.7378 0.010 73.604 0.000 0.718 0.757
===================================================================================
Ljung-Box (L1) (Q): 219.05 Jarque-Bera (JB): 19777.55
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.64 Skew: -1.30
Prob(H) (two-sided): 0.00 Kurtosis: 16.03
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
How does this model behave?
sunspots_ar_1_model_fit.plot_diagnostics(figsize=(16, 8))
plt.show()
Let's forecast the AR model and compare to the observed values in the hold-out set.
sunspot_ar_1_forecast_summary = make_forecast( sunspots_ar_1_model_fit, h_start, h_final )
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_ready, color='steelblue' )
ax.plot( sunspot_train, color='orange' )
ax.fill_between( sunspot_ma_1_forecast_summary.index,
sunspot_ma_1_forecast_summary.forecast_lwr,
sunspot_ma_1_forecast_summary.forecast_upr,
color = 'orange',
alpha=0.33)
ax.plot( sunspot_ar_1_forecast_summary.forecast_mean, color='red' )
ax.plot( sunspots_ar_1_model_fit.fittedvalues, color='black', linestyle='--')
plt.show()
That still does not seem to capture the future variation!
Let's undo the transformation to see the mismatch in the forecasts in the raw data space.
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspot_train.values, 0.05), color='orange' )
ax.fill_between( sunspot_ar_1_forecast_summary.index,
undo_log_transform(sunspot_ar_1_forecast_summary.forecast_lwr, 0.05),
undo_log_transform(sunspot_ar_1_forecast_summary.forecast_upr, 0.05),
color = 'orange',
alpha=0.33)
ax.plot( sunspot_ar_1_forecast_summary.index,
undo_log_transform(sunspot_ar_1_forecast_summary.forecast_mean, 0.05), color='red' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspots_ar_1_model_fit.fittedvalues, 0.05),
color='black', linestyle='--')
plt.show()
We can keep trying out various orders for the ARIMA model. Instead of doing this manually, let's iterate with a simple grid search over many possible $p$, $d$, and $q$ values. This way we can get programmatically determine if differencing the series helps with the forecasts rather than having to go through and manually find the best differencing to use. That said, we should have some sense already that we should difference the time series given how autocorrelated it is.
We will try out many combinations of the ARIMA $p$, $d$, and $q$ parameters. We can therefore view the problem has 3 nested for-loops! However, instead of doing that, we will use itertools functionality to create a grid of possible values and then iterate over that grid with a single for loop.
import itertools
Let's specify the candidate values to consider for the lags, differencing, and MA order.
p_grid = [0, 1, 3, 5, 7, 9, 11, 13, 15, 17]
d_grid = range(0, 4)
q_grid = range(0, 4)
Create the full-factorial product of all combinations of the three tuning parameters.
pdq = list( itertools.product(p_grid, d_grid, q_grid) )
len( pdq )
160
So we will try out 160 models! The first 3 models we will try are moving average models.
print( pdq[0] )
print( pdq[1] )
print( pdq[2] )
(0, 0, 0) (0, 0, 1) (0, 0, 2)
Fitting the ARIMA model is rather complex, and during the tuning grid search some of the models might fail to converge. So we will use a simple try-catch error handling that forces a continue should an exception be raised. We previously tuned AR models by focusing on the AIC. Thus, it might seem like a good approach to extract and store the .aic attribute after fitting each model.
pdq_results = []
aic_results = []
for param in pdq:
try:
mod_fit = ARIMA( sunspot_train,
order=param,
enforce_stationarity=True,
enforce_invertibility=True).fit(method_kwargs={'maxiter':1501})
pdq_results.append(param)
aic_results.append(mod_fit.aic)
print('ARIMA{} - AIC:{}'.format(param, mod_fit.aic))
except:
continue
ARIMA(0, 0, 0) - AIC:9987.824437622558 ARIMA(0, 0, 1) - AIC:8489.96275750978 ARIMA(0, 0, 2) - AIC:7751.209776058336 ARIMA(0, 0, 3) - AIC:7456.636702475918 ARIMA(0, 1, 0) - AIC:7046.44342450552 ARIMA(0, 1, 1) - AIC:6320.8680827667995 ARIMA(0, 1, 2) - AIC:6311.127252676057 ARIMA(0, 1, 3) - AIC:6312.668270585105 ARIMA(0, 2, 0) - AIC:9807.024807168269
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
warn('Non-invertible starting MA parameters found.'
ARIMA(0, 2, 1) - AIC:7054.722585115853 ARIMA(0, 2, 2) - AIC:6331.251608311366 ARIMA(0, 2, 3) - AIC:6321.634052138928 ARIMA(0, 3, 0) - AIC:12979.6888203351 ARIMA(0, 3, 1) - AIC:9814.291819509977 ARIMA(0, 3, 2) - AIC:7143.535340957581 ARIMA(0, 3, 3) - AIC:6368.050781567876 ARIMA(1, 0, 0) - AIC:6818.122570845275 ARIMA(1, 0, 1) - AIC:6293.342356899739 ARIMA(1, 0, 2) - AIC:6287.225087858638 ARIMA(1, 0, 3) - AIC:6289.173709988376 ARIMA(1, 1, 0) - AIC:6584.575858655601 ARIMA(1, 1, 1) - AIC:6310.950797150101 ARIMA(1, 1, 2) - AIC:6312.936379174737 ARIMA(1, 1, 3) - AIC:6312.921101745471 ARIMA(1, 2, 0) - AIC:8443.00389679542 ARIMA(1, 2, 1) - AIC:6593.694590274309 ARIMA(1, 2, 2) - AIC:6321.475468241593 ARIMA(1, 2, 3) - AIC:6334.968395175995 ARIMA(1, 3, 0) - AIC:10889.15738637095 ARIMA(1, 3, 1) - AIC:8451.73991200588 ARIMA(1, 3, 2) - AIC:6616.196446149028 ARIMA(1, 3, 3) - AIC:6364.8040510475985 ARIMA(3, 0, 0) - AIC:6383.4695794754 ARIMA(3, 0, 1) - AIC:6289.056362900776 ARIMA(3, 0, 2) - AIC:6251.718254585794 ARIMA(3, 0, 3) - AIC:6282.819592189822 ARIMA(3, 1, 0) - AIC:6348.505644906983 ARIMA(3, 1, 1) - AIC:6309.749153001626 ARIMA(3, 1, 2) - AIC:6311.519544924653 ARIMA(3, 1, 3) - AIC:6302.498348652645 ARIMA(3, 2, 0) - AIC:7470.305109158288 ARIMA(3, 2, 1) - AIC:6358.4676779182755 ARIMA(3, 2, 2) - AIC:6320.201675494642 ARIMA(3, 2, 3) - AIC:6321.979348795429 ARIMA(3, 3, 0) - AIC:9263.03913060096 ARIMA(3, 3, 1) - AIC:7480.7664489182325 ARIMA(3, 3, 2) - AIC:6382.791474224576 ARIMA(3, 3, 3) - AIC:6483.900073204575 ARIMA(5, 0, 0) - AIC:6289.259299631957 ARIMA(5, 0, 1) - AIC:6286.101512428845 ARIMA(5, 0, 2) - AIC:6288.03236700828
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
warn('Non-stationary starting autoregressive parameters'
ARIMA(5, 0, 3) - AIC:6279.835770714513 ARIMA(5, 1, 0) - AIC:6322.578465779294 ARIMA(5, 1, 1) - AIC:6313.181329423933 ARIMA(5, 1, 2) - AIC:6314.161111026417 ARIMA(5, 1, 3) - AIC:6288.842340065792 ARIMA(5, 2, 0) - AIC:7033.216735962961 ARIMA(5, 2, 1) - AIC:6332.803131881603 ARIMA(5, 2, 2) - AIC:6336.623618331774 ARIMA(5, 2, 3) - AIC:6325.453907140483 ARIMA(5, 3, 0) - AIC:8313.96621562453 ARIMA(5, 3, 1) - AIC:7044.806800202086 ARIMA(5, 3, 2) - AIC:6354.5726707201575 ARIMA(5, 3, 3) - AIC:6363.268225329409 ARIMA(7, 0, 0) - AIC:6285.09866572496 ARIMA(7, 0, 1) - AIC:6283.602511847928 ARIMA(7, 0, 2) - AIC:6268.048309891728 ARIMA(7, 0, 3) - AIC:6281.395814389889 ARIMA(7, 1, 0) - AIC:6302.541759568325 ARIMA(7, 1, 1) - AIC:6304.408240003564 ARIMA(7, 1, 2) - AIC:6297.468864000002 ARIMA(7, 1, 3) - AIC:6298.63259029843 ARIMA(7, 2, 0) - AIC:6814.14444833338 ARIMA(7, 2, 1) - AIC:6313.033135227663 ARIMA(7, 2, 2) - AIC:6314.885171770407 ARIMA(7, 2, 3) - AIC:6316.705979707178 ARIMA(7, 3, 0) - AIC:7903.385594237556 ARIMA(7, 3, 1) - AIC:6826.549199594237 ARIMA(7, 3, 2) - AIC:6342.380411181403 ARIMA(7, 3, 3) - AIC:6353.741608888586 ARIMA(9, 0, 0) - AIC:6280.017938700563 ARIMA(9, 0, 1) - AIC:6282.1249446953325 ARIMA(9, 0, 2) - AIC:6284.061092716751 ARIMA(9, 0, 3) - AIC:6272.68591242039 ARIMA(9, 1, 0) - AIC:6306.392471080824 ARIMA(9, 1, 1) - AIC:6305.591060367133 ARIMA(9, 1, 2) - AIC:6307.8680702350775 ARIMA(9, 1, 3) - AIC:6300.417920230679 ARIMA(9, 2, 0) - AIC:6639.503638082074 ARIMA(9, 2, 1) - AIC:6316.863817635507 ARIMA(9, 2, 2) - AIC:6318.722344941251 ARIMA(9, 2, 3) - AIC:6308.828250075861 ARIMA(9, 3, 0) - AIC:7572.880092487039 ARIMA(9, 3, 1) - AIC:6652.638974966084 ARIMA(9, 3, 2) - AIC:6340.367270303622 ARIMA(9, 3, 3) - AIC:6343.923165186269 ARIMA(11, 0, 0) - AIC:6274.175810583336 ARIMA(11, 0, 1) - AIC:6275.470912595894 ARIMA(11, 0, 2) - AIC:6274.456318191153 ARIMA(11, 0, 3) - AIC:6275.352778928077 ARIMA(11, 1, 0) - AIC:6304.5096027803565 ARIMA(11, 1, 1) - AIC:6306.53955017254 ARIMA(11, 1, 2) - AIC:6300.598408000983 ARIMA(11, 1, 3) - AIC:6302.436766519213 ARIMA(11, 2, 0) - AIC:6586.429087727425 ARIMA(11, 2, 1) - AIC:6314.872430989189 ARIMA(11, 2, 2) - AIC:6314.439408691945 ARIMA(11, 2, 3) - AIC:6311.922117667521 ARIMA(11, 3, 0) - AIC:7293.192172468765 ARIMA(11, 3, 1) - AIC:6599.965001116852 ARIMA(11, 3, 2) - AIC:6616.674904135811 ARIMA(11, 3, 3) - AIC:6342.2953651249845 ARIMA(13, 0, 0) - AIC:6275.977844417754 ARIMA(13, 0, 1) - AIC:6277.965246346701 ARIMA(13, 0, 2) - AIC:6275.895321073125 ARIMA(13, 0, 3) - AIC:6279.279016710003 ARIMA(13, 1, 0) - AIC:6304.851675292711 ARIMA(13, 1, 1) - AIC:6306.638712912209 ARIMA(13, 1, 2) - AIC:6303.34091398523 ARIMA(13, 1, 3) - AIC:6305.570285823467 ARIMA(13, 2, 0) - AIC:6539.75539956736 ARIMA(13, 2, 1) - AIC:6315.3144626324165 ARIMA(13, 2, 2) - AIC:6317.443960644652 ARIMA(13, 2, 3) - AIC:6312.152415847237 ARIMA(13, 3, 0) - AIC:7173.896323590476 ARIMA(13, 3, 1) - AIC:6553.677573162468 ARIMA(13, 3, 2) - AIC:6586.756857840567 ARIMA(13, 3, 3) - AIC:6588.660999605649 ARIMA(15, 0, 0) - AIC:6277.726956550747 ARIMA(15, 0, 1) - AIC:6243.432661671761 ARIMA(15, 0, 2) - AIC:6251.116763194581 ARIMA(15, 0, 3) - AIC:6252.01621786481 ARIMA(15, 1, 0) - AIC:6307.062067488597 ARIMA(15, 1, 1) - AIC:6301.810223838448 ARIMA(15, 1, 2) - AIC:6301.333165872039 ARIMA(15, 1, 3) - AIC:6300.287766158448 ARIMA(15, 2, 0) - AIC:6488.361818049951 ARIMA(15, 2, 1) - AIC:6317.450841708671 ARIMA(15, 2, 2) - AIC:6316.0428380060075 ARIMA(15, 2, 3) - AIC:6315.460814900571 ARIMA(15, 3, 0) - AIC:7093.626449540539 ARIMA(15, 3, 1) - AIC:6502.690827217652 ARIMA(15, 3, 2) - AIC:6531.271140541649 ARIMA(15, 3, 3) - AIC:6548.1185371525535 ARIMA(17, 0, 0) - AIC:6273.575152565836 ARIMA(17, 0, 1) - AIC:6245.676843897466 ARIMA(17, 0, 2) - AIC:6247.663584958931 ARIMA(17, 0, 3) - AIC:6238.189240041734 ARIMA(17, 1, 0) - AIC:6296.687807699736 ARIMA(17, 1, 1) - AIC:6294.59727938713 ARIMA(17, 1, 2) - AIC:6296.590007670916 ARIMA(17, 1, 3) - AIC:6297.198366675246 ARIMA(17, 2, 0) - AIC:6446.022392847569 ARIMA(17, 2, 1) - AIC:6306.884398261305 ARIMA(17, 2, 2) - AIC:6304.838951187656 ARIMA(17, 2, 3) - AIC:6306.822907774269 ARIMA(17, 3, 0) - AIC:6963.93089922131 ARIMA(17, 3, 1) - AIC:6460.662247862597 ARIMA(17, 3, 2) - AIC:6464.496503628081 ARIMA(17, 3, 3) - AIC:6351.344159314599
Let's compile the results into a DataFrame.
arima_results_df = pd.DataFrame({'p': [pdq_results[n][0] for n in range(len(pdq_results))],
'd': [pdq_results[n][1] for n in range(len(pdq_results))],
'q': [pdq_results[n][2] for n in range(len(pdq_results))],
'AIC': aic_results})
And it might seem like a good idea to find the model with the LOWEST AIC.
arima_results_df.sort_values(['AIC'], ascending=True)
| p | d | q | AIC | |
|---|---|---|---|---|
| 147 | 17 | 0 | 3 | 6238.189240 |
| 129 | 15 | 0 | 1 | 6243.432662 |
| 145 | 17 | 0 | 1 | 6245.676844 |
| 146 | 17 | 0 | 2 | 6247.663585 |
| 130 | 15 | 0 | 2 | 6251.116763 |
| ... | ... | ... | ... | ... |
| 8 | 0 | 2 | 0 | 9807.024807 |
| 13 | 0 | 3 | 1 | 9814.291820 |
| 0 | 0 | 0 | 0 | 9987.824438 |
| 28 | 1 | 3 | 0 | 10889.157386 |
| 12 | 0 | 3 | 0 | 12979.688820 |
160 rows × 4 columns
best_pdq_df = arima_results_df.loc[ arima_results_df.AIC == arima_results_df.AIC.min(), :].copy()
best_pdq_df
| p | d | q | AIC | |
|---|---|---|---|---|
| 147 | 17 | 0 | 3 | 6238.18924 |
We have seemed to follow everything we did previously. However, this is not actually true. The ar_select_order() does more than just extract AIC from the fitted model. The ar_select_order() function makes sure all models are evaluated on the same amount of data! The manual approach we just executed here, does not do that! The AR, differencing, and MA orders alter the training set size! Thus, we cannot directly compare the AIC values to identify the best model!
Thus, even though ARIMA models share the same Likelihood as the Linear Model we must be more careful when selecting the best ARIMA model based on AIC/BIC!
Why does this matter?
The BEST model according to AIC is the model with the smallest AIC value. Higher order ARIMA models remove more observations from the training set. Higher order and therefore more complex ARIMA models are evaulated on less data. AIC may therefore yield biased results since it is comparing the models on the same data!
How can we overcome this limitation? We will address that shortly. For now, let's continue with the identified "best" model from our incorrect procedure. Please note that the examples in the Rob Hyndman book correctly use AICc to identify the best ARIMA tuning parameters. The functions in the R statistical programming language shown in the book examples calculate AIC/BIC/AICc correctly. Thus, the book examples are similar to the correct procedure used in ar_select_order().
That said, let's fit the "supposed best" ARIMA model and inspect its performance.
best_pdq_tuple = (best_pdq_df.p.to_list()[0], best_pdq_df.d.to_list()[0], best_pdq_df.q.to_list()[0])
sunspots_arima_tune_model_fit = ARIMA( sunspot_train,
order=best_pdq_tuple,
enforce_stationarity=True, enforce_invertibility=True).\
fit(method_kwargs={'maxiter':1501})
sunspots_arima_tune_model_fit.aic
6238.2005461452
Let's now check the model summary and plot the diagnostics.
print( sunspots_arima_tune_model_fit.summary() )
SARIMAX Results
==============================================================================
Dep. Variable: log_sunspot No. Observations: 2688
Model: ARIMA(17, 0, 3) Log Likelihood -3097.095
Date: Tue, 30 Mar 2021 AIC 6238.189
Time: 14:15:57 BIC 6367.913
Sample: 01-01-1749 HQIC 6285.112
- 12-01-1972
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 3.2951 0.243 13.587 0.000 2.820 3.770
ar.L1 0.5673 0.038 15.108 0.000 0.494 0.641
ar.L2 -0.1104 0.038 -2.885 0.004 -0.185 -0.035
ar.L3 1.0282 0.039 26.279 0.000 0.951 1.105
ar.L4 -0.2196 0.024 -9.337 0.000 -0.266 -0.174
ar.L5 -0.0936 0.020 -4.662 0.000 -0.133 -0.054
ar.L6 -0.0699 0.019 -3.655 0.000 -0.107 -0.032
ar.L7 -0.0427 0.017 -2.460 0.014 -0.077 -0.009
ar.L8 0.0102 0.019 0.531 0.595 -0.027 0.048
ar.L9 0.0118 0.019 0.632 0.527 -0.025 0.048
ar.L10 -0.0012 0.020 -0.062 0.951 -0.041 0.038
ar.L11 -0.1115 0.019 -5.961 0.000 -0.148 -0.075
ar.L12 -0.0077 0.019 -0.407 0.684 -0.045 0.029
ar.L13 0.0166 0.025 0.674 0.500 -0.032 0.065
ar.L14 0.0714 0.022 3.178 0.001 0.027 0.116
ar.L15 -0.0069 0.018 -0.391 0.696 -0.041 0.028
ar.L16 -0.0024 0.019 -0.126 0.900 -0.039 0.034
ar.L17 -0.0626 0.016 -3.934 0.000 -0.094 -0.031
ma.L1 -0.1509 0.038 -3.966 0.000 -0.226 -0.076
ma.L2 0.2273 0.034 6.608 0.000 0.160 0.295
ma.L3 -0.8851 0.038 -23.150 0.000 -0.960 -0.810
sigma2 0.5860 0.007 78.418 0.000 0.571 0.601
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 19757.86
Prob(Q): 0.93 Prob(JB): 0.00
Heteroskedasticity (H): 0.58 Skew: -1.57
Prob(H) (two-sided): 0.00 Kurtosis: 15.90
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
sunspots_arima_tune_model_fit.plot_diagnostics(figsize=(16, 8))
plt.show()
Now let's forecast the future behavior with the tuned ARIMA model.
sunspot_arima_tune_forecast_summary = make_forecast( sunspots_arima_tune_model_fit, h_start, h_final )
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_ready, color='steelblue' )
ax.plot( sunspot_train, color='orange' )
ax.fill_between( sunspot_arima_tune_forecast_summary.index,
sunspot_arima_tune_forecast_summary.forecast_lwr,
sunspot_arima_tune_forecast_summary.forecast_upr,
color = 'orange',
alpha=0.33)
ax.plot( sunspot_arima_tune_forecast_summary.forecast_mean, color='red' )
ax.plot( sunspots_arima_tune_model_fit.fittedvalues, color='black', linestyle='--')
plt.show()
And let's back-transform to the original output space.
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspot_train.values, 0.05), color='orange' )
ax.fill_between( sunspot_arima_tune_forecast_summary.index,
undo_log_transform(sunspot_arima_tune_forecast_summary.forecast_lwr, 0.05),
undo_log_transform(sunspot_arima_tune_forecast_summary.forecast_upr, 0.05),
color = 'orange',
alpha=0.33)
ax.plot( sunspot_arima_tune_forecast_summary.index,
undo_log_transform(sunspot_arima_tune_forecast_summary.forecast_mean, 0.05), color='red' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspots_arima_tune_model_fit.fittedvalues, 0.05),
color='black', linestyle='--')
plt.show()
Our tuned model appeared to be the "best" model from the candidate set that considered. But it's clearly not capturing the future behavior all that well! We know that this problem has cycles not necessarily seasonal patterns! The sunspot activity has a repeating cycle roughly every 11 years. With just 17 lags we are only going back about 1.5 years!
Let's try and capture the complicated cycle over time with FOURIER terms or harmonic terms. The harmonic series represents behavior as summations of sine and cosine waves with periods set as multiples of the seasonal frequency. The Rob Hyndman book discusses FOURIER terms in more detail if you would like to learn more:
https://otexts.com/fpp3/dhr.html
Let's begin by including 5 Fourier series terms in an AR model with order 1. The Fourier terms must be created outside of the ARIMA model in statsmodels and supplied as exogenous variables. The term exogenous means it's something "outside" the variable we are forecasting. Thus, we must generate features and include those features as if they other "external inputs". We need to use the DeterministicProcess() function to create them.
from statsmodels.tsa.deterministic import DeterministicProcess
The terms are constructed using the DateTimeIndex of the series. The const argument represents the INTERCEPT or BIAS column.
fourier_terms_process = DeterministicProcess(sunspot_train.index, constant=True, fourier=5)
Let's take a look at a few of the Exogenous features in the model. Five pairs of sine and cosine features are included by generating .in_sample() features.
fourier_terms_process.in_sample().head()
| const | sin(1,12) | cos(1,12) | sin(2,12) | cos(2,12) | sin(3,12) | cos(3,12) | sin(4,12) | cos(4,12) | sin(5,12) | cos(5,12) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| date_dt | |||||||||||
| 1749-01-01 | 1.0 | 0.000000 | 1.000000e+00 | 0.000000e+00 | 1.0 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 1.0 | 0.000000 | 1.000000e+00 |
| 1749-02-01 | 1.0 | 0.500000 | 8.660254e-01 | 8.660254e-01 | 0.5 | 1.000000e+00 | 6.123234e-17 | 8.660254e-01 | -0.5 | 0.500000 | -8.660254e-01 |
| 1749-03-01 | 1.0 | 0.866025 | 5.000000e-01 | 8.660254e-01 | -0.5 | 1.224647e-16 | -1.000000e+00 | -8.660254e-01 | -0.5 | -0.866025 | 5.000000e-01 |
| 1749-04-01 | 1.0 | 1.000000 | 6.123234e-17 | 1.224647e-16 | -1.0 | -1.000000e+00 | -1.836970e-16 | -2.449294e-16 | 1.0 | 1.000000 | 3.061617e-16 |
| 1749-05-01 | 1.0 | 0.866025 | -5.000000e-01 | -8.660254e-01 | -0.5 | -2.449294e-16 | 1.000000e+00 | 8.660254e-01 | -0.5 | -0.866025 | -5.000000e-01 |
What do these features represent? Cyclic behavior at different frequencies. Higher frequency terms capture "faster dynamics" while lower frequency terms capture "slower moving" changes in the signals. The sines and cosines allow building a complex time series signal from simple waves!
sns.relplot(data = fourier_terms_process.in_sample().tail(50), kind='line', aspect=3.5)
plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Let's now fit our AR model with our "deterministic" features. We need to bring in the more general SARIMAX() function which allows providing Exogenous variables in addition to the Endenengous observations.
from statsmodels.tsa.api import SARIMAX
sunspot_harmonic_ar_1 = SARIMAX( sunspot_train, order=(1, 0, 0), exog=fourier_terms_process.in_sample(),
enforce_stationarity=True).\
fit(method_kwargs={'maxiter':1501})
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py:18: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method lbfgs is: m, pgtol, factr, maxfun, epsilon, approx_grad, bounds, loglike_and_score, iprint. The list of unsupported keyword arguments passed include: method_kwargs. After release 0.14, this will raise. warnings.warn(
print( sunspot_harmonic_ar_1.summary() )
SARIMAX Results
==============================================================================
Dep. Variable: log_sunspot No. Observations: 2688
Model: SARIMAX(1, 0, 0) Log Likelihood -3400.524
Date: Tue, 30 Mar 2021 AIC 6827.047
Time: 14:27:46 BIC 6903.702
Sample: 01-01-1749 HQIC 6854.774
- 12-01-1972
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 3.2994 0.193 17.134 0.000 2.922 3.677
sin(1,12) -0.0193 0.048 -0.406 0.685 -0.113 0.074
cos(1,12) -0.0158 0.050 -0.313 0.755 -0.115 0.083
sin(2,12) -0.0032 0.026 -0.123 0.902 -0.054 0.048
cos(2,12) -0.0392 0.027 -1.459 0.145 -0.092 0.013
sin(3,12) 0.0251 0.019 1.343 0.179 -0.012 0.062
cos(3,12) -0.0158 0.019 -0.845 0.398 -0.052 0.021
sin(4,12) -0.0035 0.016 -0.225 0.822 -0.034 0.027
cos(4,12) -0.0058 0.015 -0.375 0.708 -0.036 0.024
sin(5,12) -0.0187 0.014 -1.329 0.184 -0.046 0.009
cos(5,12) -0.0245 0.014 -1.808 0.071 -0.051 0.002
ar.L1 0.8327 0.009 89.983 0.000 0.815 0.851
sigma2 0.7348 0.010 72.650 0.000 0.715 0.755
===================================================================================
Ljung-Box (L1) (Q): 220.23 Jarque-Bera (JB): 19606.77
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.64 Skew: -1.29
Prob(H) (two-sided): 0.00 Kurtosis: 15.98
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
sunspot_harmonic_ar_1.plot_diagnostics(figsize=(16, 8))
plt.show()
In order to make forecasts, we must also forecast the exogenous fourier terms! Thus, the "external features" must be created for future behavior. This is managed by the .out_of_sample() method. Forecasting 132 horizons into the future requires generating 132 horizons of "external features".
sunspots_harmonic_ar_1_forecast = sunspot_harmonic_ar_1.forecast(132, exog=fourier_terms_process.out_of_sample(132))
plt.figure( figsize=(15, 6) )
plt.plot( sunspot_ready, color='steelblue' )
plt.plot( sunspot_train, color='orange' )
plt.plot( sunspot_harmonic_ar_1.fittedvalues, color='black', linestyle='--')
plt.plot( sunspots_harmonic_ar_1_forecast, color='red' )
plt.show()
With fourier terms we now have additional complexity to consider! The number of terms in the series! Let's try to TUNE the ARIMA order and the Fourier order by considering predictive performance on the hold-out set! We are therefore not using AIC to pick the best model. It does not matter if higher order models reduce the training set size. The performance is assessed just on the hold-out test set!
Let's calculate the hold out TEST RMSE for the current AR-1 harmonic model.
sunspot_val = sunspot_ready.loc[ sunspot_ready.index >= '1973-01-01' ].copy()
np.mean( (sunspot_val.values - sunspots_harmonic_ar_1_forecast)**2 )
1.4430156860372458
Let's define a new set of ARIMA order values to try and use a nested for-loop to handle interating over number of fourier terms to use for simplicity. We will mostly focus on the behavior of the fourier terms with this search.
pdq_b = list( itertools.product([1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]) )
k_try = [2, 3, 4, 5, 6]
len(pdq_b)
48
For simplicity let's turn off the requirements of stationarity and invertibility.
pdq_results_b = []
k_results_b = []
mse_results_b = []
for k in k_try:
for param in pdq_b:
try:
my_fourier = DeterministicProcess(sunspot_train.index, constant=True, fourier=k)
mod_fit = SARIMAX( sunspot_train,
order=param,
enforce_invertibility=False,
enforce_stationarity=False,
exog=my_fourier.in_sample()).fit(method_kwards={'maxiter':5001})
# specific to our hold out set
mod_forecast = mod_fit.forecast(132, exog=my_fourier.out_of_sample(132))
mse_value = np.mean( (sunspot_val.values - mod_forecast)**2 )
pdq_results_b.append(param)
k_results_b.append(k)
mse_results_b.append(mse_value)
print('ARIMA{} and k{} - MSE:{}'.format(param, k, mse_value))
except:
continue
ARIMA(1, 0, 0) and k2 - MSE:1.4445230073751414 ARIMA(1, 0, 1) and k2 - MSE:1.4511186848282016 ARIMA(1, 0, 2) and k2 - MSE:1.448209124474067 ARIMA(1, 0, 3) and k2 - MSE:1.4430166215417137 ARIMA(1, 1, 0) and k2 - MSE:1.0395872095015581 ARIMA(1, 1, 1) and k2 - MSE:1.0121464293995364 ARIMA(1, 1, 2) and k2 - MSE:1.0121555821282184 ARIMA(1, 1, 3) and k2 - MSE:1.0121309199784667 ARIMA(1, 2, 0) and k2 - MSE:22.983463950335707 ARIMA(1, 2, 1) and k2 - MSE:1.0493220021806882 ARIMA(1, 2, 2) and k2 - MSE:1.0174839405709561
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 2, 3) and k2 - MSE:1.01805716452543 ARIMA(1, 3, 0) and k2 - MSE:400127.7571038682 ARIMA(1, 3, 1) and k2 - MSE:23.727128057632598 ARIMA(1, 3, 2) and k2 - MSE:10.738510567903958 ARIMA(1, 3, 3) and k2 - MSE:0.839771858054977 ARIMA(2, 0, 0) and k2 - MSE:1.449458029743459 ARIMA(2, 0, 1) and k2 - MSE:1.4432700445285849 ARIMA(2, 0, 2) and k2 - MSE:1.4568856136287163 ARIMA(2, 0, 3) and k2 - MSE:1.4481282360056902 ARIMA(2, 1, 0) and k2 - MSE:1.0307736484664716 ARIMA(2, 1, 1) and k2 - MSE:1.0121474359607137 ARIMA(2, 1, 2) and k2 - MSE:1.0121274749424232 ARIMA(2, 1, 3) and k2 - MSE:1.0124903836792174 ARIMA(2, 2, 0) and k2 - MSE:57.39904565416136 ARIMA(2, 2, 1) and k2 - MSE:1.0405796834940593
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 2, 2) and k2 - MSE:1.0174089347624395 ARIMA(2, 2, 3) and k2 - MSE:1.0181963930024887 ARIMA(2, 3, 0) and k2 - MSE:128762.00786643232 ARIMA(2, 3, 1) and k2 - MSE:56.95657425272408
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 3, 2) and k2 - MSE:0.8467720952542219 ARIMA(2, 3, 3) and k2 - MSE:0.846285016650618 ARIMA(3, 0, 0) and k2 - MSE:1.4587378833018037 ARIMA(3, 0, 1) and k2 - MSE:1.4464567537296826 ARIMA(3, 0, 2) and k2 - MSE:1.4430186407053738 ARIMA(3, 0, 3) and k2 - MSE:1.4453731120192659 ARIMA(3, 1, 0) and k2 - MSE:1.0224053149433001 ARIMA(3, 1, 1) and k2 - MSE:1.0125572130403704 ARIMA(3, 1, 2) and k2 - MSE:1.0124719107637932 ARIMA(3, 1, 3) and k2 - MSE:1.012173997740259 ARIMA(3, 2, 0) and k2 - MSE:91.36323769779355 ARIMA(3, 2, 1) and k2 - MSE:1.0281104707268456 ARIMA(3, 2, 2) and k2 - MSE:1.018305771304624 ARIMA(3, 2, 3) and k2 - MSE:1.017482077688096 ARIMA(3, 3, 0) and k2 - MSE:23681.636581435727 ARIMA(3, 3, 1) and k2 - MSE:96.11695064031143
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k2 - MSE:0.9523615408279947
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k2 - MSE:0.8635723281730276 ARIMA(1, 0, 0) and k3 - MSE:1.4459276521541022 ARIMA(1, 0, 1) and k3 - MSE:1.4528692069849567 ARIMA(1, 0, 2) and k3 - MSE:1.4452791245241745 ARIMA(1, 0, 3) and k3 - MSE:1.4447003832574445 ARIMA(1, 1, 0) and k3 - MSE:1.0366492994424696 ARIMA(1, 1, 1) and k3 - MSE:1.0134706149480017 ARIMA(1, 1, 2) and k3 - MSE:1.0134658149070999 ARIMA(1, 1, 3) and k3 - MSE:1.0135027771079417 ARIMA(1, 2, 0) and k3 - MSE:7.044411415564085 ARIMA(1, 2, 1) and k3 - MSE:1.0454013316734436 ARIMA(1, 2, 2) and k3 - MSE:1.0183907312749476 ARIMA(1, 2, 3) and k3 - MSE:1.0203513882965758 ARIMA(1, 3, 0) and k3 - MSE:627865.6399110267
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 3, 1) and k3 - MSE:7.729326228692557 ARIMA(1, 3, 2) and k3 - MSE:0.8381524515010135 ARIMA(1, 3, 3) and k3 - MSE:0.841525998787084 ARIMA(2, 0, 0) and k3 - MSE:1.4510651589449468 ARIMA(2, 0, 1) and k3 - MSE:1.4450550543329754 ARIMA(2, 0, 2) and k3 - MSE:1.452263403051081 ARIMA(2, 0, 3) and k3 - MSE:1.4495387339700228 ARIMA(2, 1, 0) and k3 - MSE:1.0302789272649486 ARIMA(2, 1, 1) and k3 - MSE:1.0134706613928441 ARIMA(2, 1, 2) and k3 - MSE:1.0134678880521646 ARIMA(2, 1, 3) and k3 - MSE:1.0137041179939565 ARIMA(2, 2, 0) and k3 - MSE:42.925699481515494 ARIMA(2, 2, 1) and k3 - MSE:1.0394139573082724 ARIMA(2, 2, 2) and k3 - MSE:1.0185129136875084 ARIMA(2, 2, 3) and k3 - MSE:1.01925289373673 ARIMA(2, 3, 0) and k3 - MSE:217809.98268914357 ARIMA(2, 3, 1) and k3 - MSE:43.35204709780081 ARIMA(2, 3, 2) and k3 - MSE:0.8998705808723423 ARIMA(2, 3, 3) and k3 - MSE:0.8650288694960132 ARIMA(3, 0, 0) and k3 - MSE:1.4603919605972688 ARIMA(3, 0, 1) and k3 - MSE:1.4480884589366825 ARIMA(3, 0, 2) and k3 - MSE:1.4447170225818673
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k3 - MSE:1.4535231283932302 ARIMA(3, 1, 0) and k3 - MSE:1.0226178012350338 ARIMA(3, 1, 1) and k3 - MSE:1.013733363618924 ARIMA(3, 1, 2) and k3 - MSE:1.013669376231153 ARIMA(3, 1, 3) and k3 - MSE:1.013498010285982 ARIMA(3, 2, 0) and k3 - MSE:84.55186740682589 ARIMA(3, 2, 1) and k3 - MSE:1.0282520647738906
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 2, 2) and k3 - MSE:1.0194515826977444 ARIMA(3, 2, 3) and k3 - MSE:1.0186844491637543 ARIMA(3, 3, 0) and k3 - MSE:39569.10351136195 ARIMA(3, 3, 1) and k3 - MSE:89.80187475250227
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k3 - MSE:0.9749049377060174
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k3 - MSE:0.9679117196748309 ARIMA(1, 0, 0) and k4 - MSE:1.445732702454074 ARIMA(1, 0, 1) and k4 - MSE:1.452530406158472 ARIMA(1, 0, 2) and k4 - MSE:1.4449507852414623 ARIMA(1, 0, 3) and k4 - MSE:1.4443678556171249 ARIMA(1, 1, 0) and k4 - MSE:1.0377817015536515 ARIMA(1, 1, 1) and k4 - MSE:1.013308257570981 ARIMA(1, 1, 2) and k4 - MSE:1.0133088274650786 ARIMA(1, 1, 3) and k4 - MSE:1.0133239650749306 ARIMA(1, 2, 0) and k4 - MSE:9.516833458026108 ARIMA(1, 2, 1) and k4 - MSE:1.0467330117713423 ARIMA(1, 2, 2) and k4 - MSE:1.018266047822489
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 2, 3) and k4 - MSE:1.020482482014662 ARIMA(1, 3, 0) and k4 - MSE:581186.9377130528 ARIMA(1, 3, 1) and k4 - MSE:10.29441106774692 ARIMA(1, 3, 2) and k4 - MSE:10.371137191683191 ARIMA(1, 3, 3) and k4 - MSE:0.8432721681800508 ARIMA(2, 0, 0) and k4 - MSE:1.4508571228219724 ARIMA(2, 0, 1) and k4 - MSE:1.4447223662145416 ARIMA(2, 0, 2) and k4 - MSE:1.4518498730481826 ARIMA(2, 0, 3) and k4 - MSE:1.4493251703288677 ARIMA(2, 1, 0) and k4 - MSE:1.030803871545174 ARIMA(2, 1, 1) and k4 - MSE:1.0133116819799448 ARIMA(2, 1, 2) and k4 - MSE:1.0132997989950672 ARIMA(2, 1, 3) and k4 - MSE:1.0135910652936186 ARIMA(2, 2, 0) and k4 - MSE:45.57053545919488 ARIMA(2, 2, 1) and k4 - MSE:1.0399909486692795 ARIMA(2, 2, 2) and k4 - MSE:1.018399256212725 ARIMA(2, 2, 3) and k4 - MSE:1.0191088223245153 ARIMA(2, 3, 0) and k4 - MSE:202510.60688389343 ARIMA(2, 3, 1) and k4 - MSE:46.06936748276443 ARIMA(2, 3, 2) and k4 - MSE:0.9316841162973927 ARIMA(2, 3, 3) and k4 - MSE:0.8412098006858431 ARIMA(3, 0, 0) and k4 - MSE:1.4600873327458346 ARIMA(3, 0, 1) and k4 - MSE:1.4477477177599691 ARIMA(3, 0, 2) and k4 - MSE:1.444383894374377
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k4 - MSE:1.4729092774343662 ARIMA(3, 1, 0) and k4 - MSE:1.022928673924008 ARIMA(3, 1, 1) and k4 - MSE:1.013625188228366 ARIMA(3, 1, 2) and k4 - MSE:1.0135540416911288 ARIMA(3, 1, 3) and k4 - MSE:1.013320415986643 ARIMA(3, 2, 0) and k4 - MSE:86.86628578151156 ARIMA(3, 2, 1) and k4 - MSE:1.0286910690149391
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 2, 2) and k4 - MSE:1.0194865880735942 ARIMA(3, 2, 3) and k4 - MSE:1.0186282891529692 ARIMA(3, 3, 0) and k4 - MSE:37309.02944610683 ARIMA(3, 3, 1) and k4 - MSE:92.32206230476334
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k4 - MSE:0.9535228016091104
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k4 - MSE:0.9657277171790375 ARIMA(1, 0, 0) and k5 - MSE:1.4465580287365682 ARIMA(1, 0, 1) and k5 - MSE:1.4525450656786265 ARIMA(1, 0, 2) and k5 - MSE:1.4447626873712076 ARIMA(1, 0, 3) and k5 - MSE:1.444307502998816 ARIMA(1, 1, 0) and k5 - MSE:1.042870130520318 ARIMA(1, 1, 1) and k5 - MSE:1.0139808125032006 ARIMA(1, 1, 2) and k5 - MSE:1.0139922114115885 ARIMA(1, 1, 3) and k5 - MSE:1.0139442356686645 ARIMA(1, 2, 0) and k5 - MSE:19.498006823833478 ARIMA(1, 2, 1) and k5 - MSE:1.0523848688490076 ARIMA(1, 2, 2) and k5 - MSE:1.0191113827592824 ARIMA(1, 2, 3) and k5 - MSE:1.0215065712801696 ARIMA(1, 3, 0) and k5 - MSE:403106.79244473914 ARIMA(1, 3, 1) and k5 - MSE:19.996232311167805
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 3, 2) and k5 - MSE:0.8320041219676891
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 3, 3) and k5 - MSE:0.8349399247624977 ARIMA(2, 0, 0) and k5 - MSE:1.451240358966964 ARIMA(2, 0, 1) and k5 - MSE:1.4445964980851285 ARIMA(2, 0, 2) and k5 - MSE:1.4598767562329813 ARIMA(2, 0, 3) and k5 - MSE:1.4429934680120493 ARIMA(2, 1, 0) and k5 - MSE:1.0347526243954288 ARIMA(2, 1, 1) and k5 - MSE:1.0139807751769796 ARIMA(2, 1, 2) and k5 - MSE:1.0139342544900776 ARIMA(2, 1, 3) and k5 - MSE:1.0144700012340069 ARIMA(2, 2, 0) and k5 - MSE:57.809278081692895 ARIMA(2, 2, 1) and k5 - MSE:1.044413463173056 ARIMA(2, 2, 2) and k5 - MSE:1.0191741612081835 ARIMA(2, 2, 3) and k5 - MSE:1.0198781650434225 ARIMA(2, 3, 0) and k5 - MSE:159177.30295339794 ARIMA(2, 3, 1) and k5 - MSE:58.37106028033203
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 3, 2) and k5 - MSE:0.8752295838986773 ARIMA(2, 3, 3) and k5 - MSE:0.8382591260930224 ARIMA(3, 0, 0) and k5 - MSE:1.4600913682563272 ARIMA(3, 0, 1) and k5 - MSE:1.4476652762374485 ARIMA(3, 0, 2) and k5 - MSE:1.4442594823190753
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k5 - MSE:1.463560711376413 ARIMA(3, 1, 0) and k5 - MSE:1.0255974285833156 ARIMA(3, 1, 1) and k5 - MSE:1.0145261235841878 ARIMA(3, 1, 2) and k5 - MSE:1.0144217303328134 ARIMA(3, 1, 3) and k5 - MSE:1.013999955606924 ARIMA(3, 2, 0) and k5 - MSE:100.36651563506749 ARIMA(3, 2, 1) and k5 - MSE:1.0319202066367923 ARIMA(3, 2, 2) and k5 - MSE:1.0207691970655295 ARIMA(3, 2, 3) and k5 - MSE:1.01990849814011 ARIMA(3, 3, 0) and k5 - MSE:21592.470832901035 ARIMA(3, 3, 1) and k5 - MSE:106.1984186355968
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k5 - MSE:0.9338880944813748
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k5 - MSE:0.9384582029575268 ARIMA(1, 0, 0) and k6 - MSE:1.4517046942095109 ARIMA(1, 0, 1) and k6 - MSE:1.454419479300038 ARIMA(1, 0, 2) and k6 - MSE:1.4544334900151306 ARIMA(1, 0, 3) and k6 - MSE:1.4463235023284535 ARIMA(1, 1, 0) and k6 - MSE:1.0391343328578015 ARIMA(1, 1, 1) and k6 - MSE:1.013916100504052 ARIMA(1, 1, 2) and k6 - MSE:1.0139278965107372 ARIMA(1, 1, 3) and k6 - MSE:1.0138898416128186 ARIMA(1, 2, 0) and k6 - MSE:1.785587531084384 ARIMA(1, 2, 1) and k6 - MSE:1.0424194379662224 ARIMA(1, 2, 2) and k6 - MSE:1.019508800618041
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 2, 3) and k6 - MSE:1.0206736971768984 ARIMA(1, 3, 0) and k6 - MSE:700310.7404906161 ARIMA(1, 3, 1) and k6 - MSE:1.4722053981187604 ARIMA(1, 3, 2) and k6 - MSE:7.140644997051315 ARIMA(1, 3, 3) and k6 - MSE:0.8585061624396632 ARIMA(2, 0, 0) and k6 - MSE:1.4565354558394341 ARIMA(2, 0, 1) and k6 - MSE:1.4467200213373412 ARIMA(2, 0, 2) and k6 - MSE:1.46309474011483 ARIMA(2, 0, 3) and k6 - MSE:1.4450981328080335 ARIMA(2, 1, 0) and k6 - MSE:1.032923256064557 ARIMA(2, 1, 1) and k6 - MSE:1.013919844768972 ARIMA(2, 1, 2) and k6 - MSE:1.0138747021726073 ARIMA(2, 1, 3) and k6 - MSE:1.0143445549819696 ARIMA(2, 2, 0) and k6 - MSE:28.106914236328635 ARIMA(2, 2, 1) and k6 - MSE:1.039400245088305
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 2, 2) and k6 - MSE:1.019548079582722 ARIMA(2, 2, 3) and k6 - MSE:1.020354156549059 ARIMA(2, 3, 0) and k6 - MSE:310626.84663612465 ARIMA(2, 3, 1) and k6 - MSE:15.975406525149072 ARIMA(2, 3, 2) and k6 - MSE:0.8804474385016599 ARIMA(2, 3, 3) and k6 - MSE:0.8647024094720472 ARIMA(3, 0, 0) and k6 - MSE:1.466013642327508 ARIMA(3, 0, 1) and k6 - MSE:1.4499614981652935 ARIMA(3, 0, 2) and k6 - MSE:1.446279869791077
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k6 - MSE:1.507204893273465 ARIMA(3, 1, 0) and k6 - MSE:1.024698632915123 ARIMA(3, 1, 1) and k6 - MSE:1.0144163720923582 ARIMA(3, 1, 2) and k6 - MSE:1.0143181651051922 ARIMA(3, 1, 3) and k6 - MSE:1.013921154905784 ARIMA(3, 2, 0) and k6 - MSE:70.2291437079569 ARIMA(3, 2, 1) and k6 - MSE:1.0295745938485363
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 2, 2) and k6 - MSE:1.0207508550807995 ARIMA(3, 2, 3) and k6 - MSE:1.020077953377376 ARIMA(3, 3, 0) and k6 - MSE:57949.39637741124 ARIMA(3, 3, 1) and k6 - MSE:60.36438728142855
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k6 - MSE:0.9452932905296477 ARIMA(3, 3, 3) and k6 - MSE:0.9634955455585652
Let's compile all the results together into a DataFrame.
arima_fourier_results_df = pd.DataFrame({'p': [pdq_results_b[n][0] for n in range(len(pdq_results_b))],
'd': [pdq_results_b[n][1] for n in range(len(pdq_results_b))],
'q': [pdq_results_b[n][2] for n in range(len(pdq_results_b))],
'k': k_results_b,
'MSE': mse_results_b})
arima_fourier_results_df.sort_values(['MSE'], ascending=True)
| p | d | q | k | MSE | |
|---|---|---|---|---|---|
| 158 | 1 | 3 | 2 | 5 | 0.832004 |
| 159 | 1 | 3 | 3 | 5 | 0.834940 |
| 62 | 1 | 3 | 2 | 3 | 0.838152 |
| 175 | 2 | 3 | 3 | 5 | 0.838259 |
| 15 | 1 | 3 | 3 | 2 | 0.839772 |
| ... | ... | ... | ... | ... | ... |
| 12 | 1 | 3 | 0 | 2 | 400127.757104 |
| 156 | 1 | 3 | 0 | 5 | 403106.792445 |
| 108 | 1 | 3 | 0 | 4 | 581186.937713 |
| 60 | 1 | 3 | 0 | 3 | 627865.639911 |
| 204 | 1 | 3 | 0 | 6 | 700310.740491 |
240 rows × 5 columns
Let's fit the optimal harmonic ARIMA model and forecast the hold-out validation set.
fourier_tune_process = DeterministicProcess(sunspot_train.index, constant=True, fourier=5)
sunspot_harmonic_arima = SARIMAX( sunspot_train, order=(1, 3, 2), exog=fourier_tune_process.in_sample(),
enforce_stationarity=False, enforce_invertibility=False).\
fit(method_kwargs={'maxiter':10001})
sunspot_harmonic_arima_forecast = sunspot_harmonic_arima.forecast(132, exog=fourier_tune_process.out_of_sample(132))
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py:18: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method lbfgs is: m, pgtol, factr, maxfun, epsilon, approx_grad, bounds, loglike_and_score, iprint. The list of unsupported keyword arguments passed include: method_kwargs. After release 0.14, this will raise. warnings.warn(
Let's visualize the behavior.
plt.figure( figsize=(15, 6) )
plt.plot( sunspot_ready, color='steelblue' )
plt.plot( sunspot_train, color='orange' )
plt.plot( sunspot_harmonic_arima.fittedvalues, color='black', linestyle='--')
plt.plot( sunspot_harmonic_arima_forecast, color='red' )
plt.show()
As we can see above, even after all that tuning...our model is not great!
What can we do? Consider that we now there's a cycle. The cycle repeats roughly every 11 years. So let's define a new periodicity for the Fourier terms to enable studying this long term cycle. We will override the frequency with the period argument and set it to 132 to correspond to 11 * 12. This allows us to consider very high order Fourier terms.
fourier_long_process = DeterministicProcess(sunspot_train.index, period=132, constant=True, fourier=50)
Let's fit the ARIMA model with this custom exogeneous periodicity with a simple ARMA model.
sunspot_harmonic_long_arma = SARIMAX( sunspot_train, order=(1, 0, 1), exog=fourier_long_process.in_sample(),
enforce_stationarity=False, enforce_invertibility=False).\
fit(method_kwargs={'maxiter':10001})
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py:18: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method lbfgs is: m, pgtol, factr, maxfun, epsilon, approx_grad, bounds, loglike_and_score, iprint. The list of unsupported keyword arguments passed include: method_kwargs. After release 0.14, this will raise. warnings.warn(
sunspot_harmonic_long_arma.plot_diagnostics(figsize=(16, 8))
plt.show()
Let's forecast with our new model.
sunspot_harmonic_long_arima_forecast = sunspot_harmonic_long_arma.forecast(132, exog=fourier_long_process.out_of_sample(132))
plt.figure( figsize=(15, 6) )
plt.plot( sunspot_ready, color='steelblue' )
plt.plot( sunspot_train, color='orange' )
plt.plot( sunspot_harmonic_long_arma.fittedvalues, color='black', linestyle='--')
plt.plot( sunspot_harmonic_long_arima_forecast, color='red' )
plt.show()
As shown above we're finally getting the right behavior! It's not perfect, and perhaps we can tune the fourier term and the cycle period to get even better results. Let's extract the prediction interval to complete the exercise with our current model.
def make_forecast_with_fourier(my_mod, my_fourier, steps_ahead):
exog_fourier = my_fourier.out_of_sample(steps_ahead)
my_forecast = my_mod.forecast(steps_ahead, exog=exog_fourier)
start_time = my_forecast.index.min()
final_time = my_forecast.index.max()
forecast_object = my_mod.get_prediction(start_time, final_time, exog=exog_fourier)
forecast_mean = forecast_object.predicted_mean
forecast_ci = forecast_object.conf_int()
res = pd.DataFrame({'forecast_mean': forecast_mean,
'forecast_lwr': forecast_ci.iloc[:, 0],
'forecast_upr': forecast_ci.iloc[:, 1]},
index=forecast_ci.index)
return res
sunspot_harmonic_long_arma_forecast_summary = make_forecast_with_fourier( sunspot_harmonic_long_arma, fourier_long_process, 132)
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_ready, color='steelblue' )
ax.plot( sunspot_train, color='orange' )
ax.fill_between( sunspot_harmonic_long_arma_forecast_summary.index,
sunspot_harmonic_long_arma_forecast_summary.forecast_lwr,
sunspot_harmonic_long_arma_forecast_summary.forecast_upr,
color = 'orange',
alpha=0.33)
ax.plot( sunspot_harmonic_long_arma_forecast_summary.forecast_mean, color='red' )
ax.plot( sunspot_harmonic_long_arma.fittedvalues, color='black', linestyle='--')
plt.show()
And back-transform to the original output space.
fig, ax = plt.subplots( figsize=(15, 6) )
ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspot_train.values, 0.05), color='orange' )
ax.fill_between( sunspot_harmonic_long_arma_forecast_summary.index,
undo_log_transform(sunspot_harmonic_long_arma_forecast_summary.forecast_lwr, 0.05),
undo_log_transform(sunspot_harmonic_long_arma_forecast_summary.forecast_upr, 0.05),
color = 'orange',
alpha=0.33)
ax.plot( sunspot_harmonic_long_arma_forecast_summary.index,
undo_log_transform(sunspot_harmonic_long_arma_forecast_summary.forecast_mean, 0.05), color='red' )
ax.plot( sunspot_train.index,
undo_log_transform(sunspot_harmonic_long_arma.fittedvalues, 0.05),
color='black', linestyle='--')
plt.show()
This notebook introduced the key functions and modeling approaches for working with advanced ARIMA models for time series forecasting. ARIMA models include many tuning parameters that control the model behavior. An appropriate validation procedure is required to properly tune such models.